Skip to content

Extending LLM usage for PDFs when the extracted text is empty after pdfminer#1285

Open
gjmveloso wants to merge 5 commits intomicrosoft:mainfrom
gjmveloso:feat/pdf_fallback_with_llm
Open

Extending LLM usage for PDFs when the extracted text is empty after pdfminer#1285
gjmveloso wants to merge 5 commits intomicrosoft:mainfrom
gjmveloso:feat/pdf_fallback_with_llm

Conversation

@gjmveloso
Copy link
Copy Markdown

Initial work to attempt to use LLM to perform OCR operations within a PDF when pdfminer returns empty text

@gjmveloso gjmveloso changed the title Extending LLM usage for PDFs where the extracted text was empty with pdfminer Extending LLM usage for PDFs when the extracted text is empty after pdfminer Jun 6, 2025
@gjmveloso
Copy link
Copy Markdown
Author

@microsoft-github-policy-service agree

@gjmveloso gjmveloso marked this pull request as draft June 9, 2025 19:38
@gjmveloso gjmveloso marked this pull request as ready for review June 9, 2025 22:08
prompt=llm_prompt,
)

return DocumentConverterResult(markdown=str(markdown))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an issue of PDFs containing both mineable text and images that contain text. It would be nice to have a more sophisticated branching mechanism that accounts for this and/or allowing an API to override by the markitdown caller.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you thinking on something like replacing the usage of extract_text with extract_pages and iterate over its non-text elements, like LTImage and LTFigure?

Layout system reference:
https://pdfminersix.readthedocs.io/en/latest/topic/converting_pdf_to_text.html#topic-pdf-to-text-layout

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - that would allow a much more reliable, predictable, and comprehensive text extraction.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally, done. Thoughts?

- Proper handling of file_stream positioning after an empty result from pdfminer
- Resolve merge conflicts that were baked into the previous commits
- Add llm_caption import and two prompt constants (_PDF_IMAGE_LLM_PROMPT,
  _PDF_FULL_LLM_PROMPT) to avoid inline prompt strings
- Add _collect_lt_images() and _get_lt_image_data() helpers for extracting
  JPEG/JPEG2000 image data from pdfminer LTImage objects; use pdfminer's own
  LITERALS_DCT_DECODE / LITERALS_JPX_DECODE for filter comparison instead of
  fragile PSLiteral string conversion
- When no form pages are detected, use pdfminer extract_text for prose quality,
  then do a second pass with extract_pages to find LTFigure elements containing
  embedded images and caption each one via the LLM
- Add last-resort whole-document LLM fallback for fully non-searchable PDFs
  where no captionable images were found
- Guard _merge_partial_numbering_lines call against None return from llm_caption
@gjmveloso gjmveloso force-pushed the feat/pdf_fallback_with_llm branch from c83bacc to 6742995 Compare April 7, 2026 18:43
@gjmveloso gjmveloso requested a review from dillonstreator April 7, 2026 18:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants